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The latest generation of Web search tools is beginning to exploit hypertext link in- 
formation to improve rankingSH and crawlingEH^Talgorithms. The hidden assumption 
behind such approaches, a correlation between the graph structure of the Web and its 
content, has not been tested explicitly despite increasing research on Web topologylUl. 
Here I formalize and quantitatively validate two conjectures drawing connections from 
link information to lexical and semantic Web content. The link- content conjecture 
states that a page is similar to the pages that link to it, i.e., one can infer the lexical 
content of a page by looking at the pages that link to it. I also show that lexical 
7— I ■ inferences based on link cues are quite heterogeneous across Web communities. The 

link-cluster conjecture states that pages about the same topic are clustered together, 
i.e., one can infer the meaning of a page by looking at its neighbours. These results ex- 
plain the success of the newest search technologies and open the way for more dynamic 
and scalable methods to locate information in a topic or user driven way. 

7-H ! 

All search engines basically perform two functions: (i) crawling Web pages to maintain an index, 
& ■ and (ii) matching URLs in the index database against user queries. Effective search engines achieve 

a high coverage of the Web, keep their index fresh, and rank hits in a way that correlates with the 
. ^ | user's notion of relevance. Ranking and crawling algorithms use cues from words and hyperlinks, 

^ ■ associated respectively with lexical and link topology. In the former, two pages are close to each 

other if they have similar textual content; in the latter, if there is a short path between them. 
Lexical metrics are traditionally used by search engines to rank hits according to their similarity 
to the query, thus attempting to infer the semantics of pages from their lexical representation. 
Similarity metrics are derived from the vector space modeled, that represents each document or 
query by a vector with one dimension for each term and a weight along that dimension that 
estimates the term's contribution to the meaning of the document. The cluster hypothesis behind 
this model is that a document lexically close to a relevant document is also relevant with high 
probability^. Links have traditionally been used by search engine crawlers only in exhaustive, 
centralized algorithms. However the latest generation of Web search tools is beginning to integrate 
lexical and link metrics to improve ranking and crawling performance through better models of 
relevance. The best known example is the PageRank metric used by Google: pages containing the 
query's lexical features are ranked using query-independent link analysis^. Links are also used in 
conjunction with text toJdentify hub and authority pages for a certain subject!!, determine the 
reputation of a given siteSj, and guide search agents crawling on behalf of users or topical search 
engines!"!. 
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To study the connection between link and lexical topologies, I conjecture a positive correlation 
between distance measures denned in the two spaces. Given any pair of Web pages (pi,P2) we 
have well-defined distance functions 5i and 5t in link and lexical space, respectively. To compute 
Si{p±,P2) we use the Web hypertext structure to find the length, in links, of the shortest path from 
Pi to p2- (This is not a metric distance because it is not symmetric in a directed graph, but for 
convenience I refer to Si as "distance".) To compute St(pi,P2) we can use the vector representations 
of the two pages, where the vector components (weights) of page p, Wp, are computed for terms k 
in the textual content of p given some weighting scheme. One possibility would be to use Euclidean 
distance in this word vector space, or any other L z norm. However, L z metrics have a dependency 
on the dimensionality of the pages, i.e., larger documents tend to appear more distant from each 
other than shorter ones, irrespective of content. To circumvent this problem, one can instead 
define a metric based on the similarity between pages. Let us use the cosine similarity function, a 
standard measure in information retrieval: 

<T(Pl,P2)= i fcT ^' 

12kepi( w pi) Sfcep 2 ( ll, P2) 



According to the link-content conjecture, a is anticorrelated with Si. The idea is to measure the 
correlation between the two distance measures across pairs of pages. Figure [l] illustrates how a 
collection of Web pages was crawled and processed for this purpose. 

The link distances Si(q,p) and similarities o~(q,p) were averaged for each topic q over all pages p in 
the crawl set P% for each depth d: 

S(q,d) ee (5|(9,P))i5 = ^fE<-W-^i) ( 2 ) 

d i=l 

a(q,d) ee (a(q,p)) P <, = —3 a(q,p). (3) 

d p£P q , 



The 300 measures of S(q, d) and a(q, d) from Equations |2| and |3| are shown in Figure pi The two 
metrics are indeed well anticorrelated and predictive of each other with high statistical significance. 
This quantitatively confirms the link-content conjecture. 

To analyze the decrease in the reliability of lexical content inferences with distance from the topic 
page in link space one can perform a nonlinear least-squares fit of these data to a family of expo- 
nential decay models: 

a{S)^a 00 + {l-a 00 )e-^ sa2 (4) 

using the 300 points as independent samples. Here o"oo is the noise level in similarity. Note that 
while starting from Yahoo pages may bias a(5 < 1) upward, the decay fit is most affected by the 
constraint a(S = 0) = 1 (by definition of similarity) and by the longer-range measures a{8 > 1). 
The similarity decay fit curve is also shown in Figure ^. It provides us with a rough estimate of 
how far in link space one can make inferences about lexical content. 

How heterogeneous is the reliability of lexical inferences based on link neighbourhood across com- 
munities of Web content providers? To answer this question the crawled pages were divided up into 
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connected sets within top-level Internet domains. The scatter plot of the 5(q, d) and a(q, d) mea- 
sures for these domain-based crawls is shown in Figure ||a. The plot illustrates the heterogeneity 
in the reliability of lexical inferences based on link cues across domains. The parameters obtained 
from fitting each domain data to the exponential decay model of Equation || (Figure |3|b) estimate 
how reliably links point to lexically related pages in each domain. A summary of the statistically 
significant differences among the parametric estimates is shown in Figure [3|c. It is evident that, for 
example, academic Web pages are better connected to each other than commercial pages in that 
they do a better job at pointing to other similar pages. In other words it is easier to find related 
pages browsing through academic pages than through commercial pages. This is not surprising 
considering the different goals of the two communities. 



The link-cluster conjecture is a link-based analog of the cluster hypothesis, stating that pages within 
a few links from a relevant source are also relevant with high probability. Here I experimentally 
assess the extent to which relevance is preserved within link space neighbourhoods, and the decay 
in expected relevance as one browses away from a relevant page. 



The link-cluster conjecture has been implied or stated in various formsM&tH One can most 
simply and generally state it in terms of the conditional probability that a page p is relevant with 
respect to some query q, given that page r is relevant and that p is within d links from r: 

R q (d) = Pi[rel q {p) | rel q (r) A 5i(r,p) < d] (5) 

where rel q () is a binary relevance assessment with respect to q. In other words a page has a higher 
than random probability of being about a certain topic if it is in the neighbourhood of other pages 
about that topic. R q (d) is the posterior relevance probability given the evidence of a relevant page 
nearby. The simplest form of the link-cluster conjecture is stated by comparing Rq(l) to the prior 
relevance probability G q : 

G q = Pv[rel q (p)] (6) 

also known as the generality of the query. If link neighbourhoods allow for semantic inferences, 
then the following condition must hold: 

A( g ,d=l) = ^>l. (7) 

To illustrate the meaning of the link-cluster conjecture, consider a random crawler (or user) search- 
ing for pages about a topic q. Call r] q {t) the probability that the crawler hits a relevant page at 
time t. Solving the recursion 

Vq (t + 1) = Vq (t) ■ R q (l) + (1 - rj q (t)) ■ G q (8) 

for r] q (t + 1) = rj q (t) yields the stationary hit rate 

^ = i + Gq -R q (iy (9) 

The link-cluster conjecture is a necessary and sufficient condition for such a crawler to have a better 
than chance hit rate, thus justifying the crawling (and browsing!) activity: 

if > G q X(q, 1) > 1. (10) 
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Definition ^ can be generalized to likelihood factors over larger neighbourhoods: 



A(?,d) = ^>*3?l (11) 

and a stronger version of the conjecture can be formulated as follows: 

X(q,d) » 1 for S(q,d) < 5* (12) 
where 5* is a critical link distance beyond which semantic inferences are unreliable. 



I first attempted to measure the likelihood factor X(q, 1) for a few queries and found that 
(X(q, l)) q ^> 1, but those estimates were based on very noisy relevance assessments!-!]. To obtain a 
reliable quantitative validation of the stronger link-cluster conjecture, I repeated such measurements 
on the data set described in Figure |l[ 

The 300 measures of X(q,d) thus obtained are plotted versus 5(q,d) from Equation || in Figure |||. 
Closeness to a relevant page in link space is highly predictive of relevance, increasing the relevance 
probability by a likelihood factor X(q, d) ^> 1 over the range of observed distances and queries. 

We also performed a nonlinear least-squares fit of these data to a family of exponential decay 
functions using the 300 points as independent samples: 

X(5) ~ 1 + a 3 e" Q45Q5 . (13) 

Note that this three-parameter model is more complex than the one in Equation || because X(5 = 0) 
must also be estimated from the data (X(q, 0) = 1/G q ). The relationship between link distance and 
the semantic likelihood factor is less regular than between link distance and lexical similarity. The 
resulting fit (also shown in Figure Q) provides us with a rough estimate of how far in link space 
we can make inferences about the semantics (relevance) of pages, i.e., up to a critical distance S* 
between 4 and 5 links. 



It is surprising that the link-content and link-cluster conjectures have not been formalized and 
addressed explicitly before, especially when one looks at the considerable attention recently received 
by the Web's graph topologyoll. The correlation between Web links and content takes on additional 
significance in light of link analysis studies that tell us the Web is a "small world" network, i.e., 
a graph with an inverse power law distribution of in-links and out-linksLfO. Small world networks 
have a mixture of non-random local structure and non-local random links. Such a topology creates 
short paths between pages, whose length scales logarithmically with the number of Web pages. The 
present results indicate that the Web's local structure is created by the semantic clusters resulting 
from authors linking their pages to related resources. 

The link-cluster and link-content conjectures have important normative implications for future Web 
search technology. For example the measurements in this paper suggest that topic driven crawlers 
should keep track of their position with a bias to remain within a few links from some relevant 
source. In such a range hyperlinks create detectable signals about lexical and semantic content, 
despite the Web's chaotic lack of structure. Absent such signals, the short paths predicted by the 
small world model might be very hard to locate for localized algorithms £3. In general the present 
findings should foster the design of better search tools by integrating traditional search engines with 
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topic- and query-driven crawlerstil guided by local link and lexical clues. Smart crawlers of this 
kind are already emerging (see for example http://myspiders.biz.uiowa.edu). Due to the size 
and dynamic nature of the Web, the efficiency- motivated search engine practice of keening query 
processing separate from crawling leads to poor trade-offs between coverage and recencycJ. Closing 
the loop from user queries to smart crawlers will lead to dynamic indices with more scalable and 
user-driven update algorithms than the centralized ones used today. 
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Figure 1: Representation of the data collection. 100 topic pages were chosen in the Yahoo directory 
owing to this portal's wide popularity. Yahoo category pages are marked "Y", external pages are 
marked "W" . The topic pages were chosen among "leaf" categories, i.e. without sub-categories. 
This way the external pages linked by a topic page ("Yq") represent the relevant set compiled 
for that topic by the Yahoo editors (shaded). Topics were selected in breadth-first order and 
therefore covered the full spectrum of Yahoo top-level categories. In this example the topic is 
SOCIETY CULTURE BIBLIOGRAPHY. Arrows represent hyperlinks and dotted arrows are examples of 
links pointing back to the relevant set. For each topic, we performed a breadth-first crawl up to 
a depth of 3 links. The crawl set is represented inside the dashed line. To obtain meaningful and 
comparable statistics at 5i = 1, only topic pages with at least 5 external links were used, and only 
the first 10 links for topic pages with over 10 links. Each crawl was stopped if 10,000 pages had 
been downloaded at depth 5i = 3 from the start page. A timeout of 60 seconds was applied for 
each page. The resulting collection comprised 376,483 pages. The text of each fetched page was 
parsed to extract links and terms. Terms were conflated using a standard stemming algorithmic. 
A common TFIDF weighting schemeE3 was employed to represent each page in word vector space. 
This model assumes a global measure of term frequency across pages (inverse document frequency) . 
To make the measures scalable with the maximum crawl depth (a parameter), inverse document 
frequency was computed as a function of distance from the start page, among the set of documents 
within that distance from the source. Formally, for each topic q, page p, term k and depth d: 
w pdq = tf{k>P) ' idf(k,d,q) where tf(k,p) is the number of occurrences of term k in page p and 

idf(k, d, q) = 1 + ln ( Jvfy 1 • Here N% is the size of the cumulative page set = {p : 5i(q,p) < d}, 
and N%(k) is the size of the subset of P% of pages containing term k. 
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Figure 2: Scatter plot of a(q,d) versus S(q,d) for topics q = 0, ...,99 and depths d = 1,2,3. 
Pearson's correlation coefficient p = — 0.76,p < 0.0001. The similarity noise level (Too and an expo- 
nential decay fit of the data and are also shown. was computed by comparing each topic page 



to external pages linked from different Yahoo categories: = 
0.0318 ± 0.0006. The regression yielded parametric estimates a\ •• 
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Figure 3: a. Scatter plot of a(q, d) versus 5(q, d) for topics q = 0, . . . , 99 and depths d = 1, 2, 3, for 
each of the major US top-level domains. The domain sets were obtained by simulating crawlers 
that only follow links to servers within each domain. An exponential decay fit is also shown for each 
domain, b. Exponential decay model parameters obtained by nonlinear least-squares fit of each 
domain data. c. Summary of statistically significant differences (at the 68.3% confidence level) 
between the parametric estimates; dashed arrows represent significant differences in oi\ only, and 
solid arrows significant differences in both a\ and ct2- 
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Figure 4: Scatter plot of X(q,d) versus 6(q,d) for topics q = 0, ...,99 and depths d = 1,2,3. 
Pearson's p = — 0.1, p = 0.09. In computing \(q,d) from Definition [ll], the relevant set Q q compiled 



by the Yahoo editors for each topic q was used to estimate R q (d) 



N 1 



(cf. dotted links in 



Figure Ft]). Generality was approximated by G q ~ 



where all of the relevant links for 



each topic q are included in Q' q , even for topics where only the first 10 links were used in the crawl 
(Qq 2 Qq), and the set Y in the denominator includes all Yahoo leaf categories. An exponential 
decay fit of the data is also shown. The regression yielded parametric estimates 03 1000, 
04 rs 0.002 and a$ 5.5. 
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